'.:i '!  vM :  ■ ,  f  jj:'y'S^-mui.-i,4t 


'm^'^^ 


*'"''■' A.',  ■■.   ■•>   ■■A.,  ,<■     ,       '.'    V.;-  ■      ■ 


working  paper 
department 
of  economics 


ON  A  GENERAL  APPROACH  TO  SEARCH  AKD 
INFORMATION  GATHERING 

Kevin  W.S.  Roberts 
Martin  L.  Weitzman 


Number  263 


August,  1980 


massachusetts 
institute  of 

technology 

50  memorial  drive 
Cambridge,  mass.  02139 


14,14-^ 


ON  A  GENERAL  APPROACH  TO  SEARCH  AND 
INFORMATION  GATHERING 

Kevin  W.S.  Roberts 
Martin  L.  Weitzman 

Number  263  August,  1980 


Digitized  by  the  Internet  Archive 

in  2011  with  funding  from 

IVIIT  Libraries 


http://www.archive.org/details/ongeneralapproac263robe 


Sunnnary 

We  show  that  an  important  class  of  multi-stage  decision  problems, 
of  which  conventional  search  theory  is  a  special  case,  can  be  formulated 
and  constructively  solved  within  a  unified  framework.   The  optimal 
strategy  is  an  elementary  reservation  price  rule,  allowing  an  intuitive 
economic  interpretation  and  permitting  problems  to  be  solved  in  polynomial 
rather  than  exponential  time.   Computationally  efficient  algorithms 
are  presented  which  can  be  used  to  numerically  calculate  reservation 
prices  in  real  situations.   We  investigate  the  qualitative  properties 
of  an  optimal  policy,  analyze  how  they  depend  on  various  underlying 
economic  features  of  the  problem,  and  note  what  they  imply  about  optimal 
decisions  in  different  contexts. 

Introduction  '■ 

A  broad  class  of  dynamic  allocation  models  can  be  roughly  described 
as  follows.   A  decision  maker  has  some  number  of  activities,  projects, 
or  opportunities  from  among  which  he  chooses  one  for  further  development . 
Depending  upon  the  project  selected  some  reward  is  received  and  the  state 
of  that  project  is  possibly  altered,  while  the  other  projects  remain 
intact  in  their  previous  condition.   The  basic  problem  is  to  choose 
opportunities  at  each  decision  time  to  maximize  expected  present  dis- 
counted value.   While  complete  characterization  will  have  to  await  the 
Introduction  of  formal  notation,  suffice  it  here  to  note  that  there  are 
many  useful  applications. 
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Although  the  basic  philosophy  of  this  paper  argues  that  such 
models  are  best  viewed  as  multi-stage  generalizations  of  search 
problems,  we  adhere  to  an  existing  nomenclature  in  the  statistics 
literature  which  has  classified  several  examples  as  so-called  "bandit 
processes"  by  analogy  with  the  multi-armed  bandit  problem. 

This  paper  has  several  objectives.  After  presenting  a  discrete- 
state  framework  for  formulating  bandit  processes,  we  show  that  a 
rather  impressive  variety  of  dynamic  allocation  problems  from  seemingly 
unrelated  areas  of  economics,  statistics,  and  operations  research  can 
be  cast  in  that  form.  Then  we  prove  the  existence  of  a  general, 
constructive  method  for  solving  bandit  processes. 

The  solution,  which  by  all  reckoning  should  be  complicated  to 
state  and  very  difficult  to  solve,  can  in  fact  be  characterized  by  an 
elementary  rule  familiar  to  economists.  Each  stage  of  a  project  is 
assigned  a  reservation  price  —  a  critical  number  analogous  to  an  inter- 
nal rate  of  return,  depending  only  on  the  project  and  its  stage, 
independent  of  all  other  projects,  and  possessing  a  simple,  intuitive 
economic  interpretation.  The  optimal  rule  is  to  proceed  next  with  that 
activity,  project,  or  opportunity  having  the  highest  reservation  price. 


1 
In  previous  versions  of  this  paper,  before  the  work  of  Gittens  and 
others  in  the  statistics  literature  was  brought  to  our  attention,  we 
called  our  model  a  "multi-stage  search  process".   We  still  prefer 
this  description,  but  of  course  it  would  now  be  confusing  to  persist 
with  our  own  terminology. 
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Slnce  knowing  reservation  prices  is  tantamount  to  solving  a 
bandit  process,  we  can  understand  the  basic  qualitative  features  of 
an  optimal  policy  by  analyzing  the  main  factors  determining  a  project's 
reservation  price  —  like  profitability,  uncertainty,  information, 
learning,  flexibility,  degree  of  increasing  and  decreasing  returns,  etc. 
In  the  context  of  various  models,  we  try  to  explain  why  reservation 
prices  differ  between  projects  and  how  a  reservation  price  is  likely 
to  change  as  a  project  is  developed. 

Not  only  can  reservation  prices  be  used  to  characterize  the  form 
of  an  optimal  policy,  but  they  allow  any  bandit  process  to  be  solved 
in  polynomial  rather  than  exponential  time.   We  show  that  the  theory 
of  discrete  state  reservation  prices  is  essentially  constructive. 
Reservation  prices  satisfy  dynamic  programming  equations  of  a  form  that 
allow  them  to  be  numerically  calculated  by  iterative  alogorithms  with 
known  and  powerful  convergence  properties.   This  should  make  it 
computationally  feasible  to  actually  solve  real  world  problems  of  a 
surprisingly  large  size. 

The  history  of  the  problem  is  complicated  to  trace  because  a 
great  many  examples  of  bandit  process  have  been  effectively  treated 

without  real  awareness  of  the  underlying  connection  between  them  or 

2 

of  the  existence  of  a  unified  theory.   To  Gittens  and  some  other 

statisticians  belongs  the  credit  for  formulating  the  first  truly 


2 
Some  of  these  examples  will  be  presented  in  a  later  section. 
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general  mathematical  statement  of  the  model,  while  primary  emphasis 

has  been  on  solving  problems  in  statistical  design  akin  to  the  multi- 

3 

armed  bandit  problem.    It  seems  fair  to  say  that  only  quite  recently 

have  people  started  to  become  aware  of  the  full  range  of  problems 
covered  by  this  kind  of  theory. 

To  the  economist,  bandit  processes  are  important  because  they 
form  an  elegant  and  operational  theory  which  nicely  captures  the  role 
of  infomnation  gathering  in  dynamic  resource  allocation.   Within 
economics  this  role  has  been  played  largely  by  search  models,  of  which 
bandit  processes  are  a  powerful  generalization. 


The  Model 

There  are  N  opportunities  or  projects,  indexed  n  =  1,2,...,N. 
At  any  decision  time  exactly  one  project  of  the  N  must  be  selected 
for  further  development.  Let  project  n  be  in  state  i.   If  project 
n  is  chosen, (expected)  reward  R  is  collected  and  project  n  makes  a 
transition  from  state  i  to  state  j  with  probability  P. .•   Every  other 

project  remains  locked  in  its  previous  state.  A  discount  factor  B. 

4 

is  applied  to  future  returns. 


3 

See  Gittens  [1979],  Whittle  [1980],  and  the  references  cited  therein. 

The  essential  results  of  the  present  paper  were  independently  stated, 
proved,  and  written  in  an  earlier  version  before  we  became  aware  of 
Gittens'  pioneering  work.   So  far  as  we  know,  the  connection  between 
search  theory  and  bandit  processes  has  not  been  treated  anywhere  in 
the  literature,  and  it  is  primarily  this  aspect  which  is  most  important 
in  economic  applications.   The  search  theory  —  reservation  price 
approach  to  bandit  processes  is  also,  in  our  opinion,  the  most  intuitively 
appealing  way  to  understand  the  dynamic  programming  conditions  and  to 
prove  the  basic  theorem. 

4 

A  superscript  on  a  variable  indexes  the  project  and  does  not  mean  raising 

the  variable  to  that  exponent. 
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Thus,  if  project  n  is  selected,  the  system  as  a  whole  moves 

from  state 

N 

S  =  X-  i(m)  (1) 

m=i 

to  state 

S  -  >i(n)<  +  >j<  (2) 

with  probability  P..,  where  the  notation  S  -  >i(n)<  +  >j<  means  a 

state  identical  to  S  except  that  project  n  is  in  state  j  instead  of 

i(n). 

The  problem  of  selecting  a  project  to  maximize  expected  present 
discounted  value  can  be  posed  in  dynamic  programming  format.   Let  H'(S) 
represent  the  expected  present  discounted  value  of  following  an  optimal 
policy  from  this  time  on  when  the  state  of  the  system  is  S. 

For  each  S,  the  state  valuation  functions  T  must  satisfy  the 

fundamental  recursive  relation 

^-CS)  =  max  {r"  +  6"ZP^.  'i'(S  -  >i(n)<  +  >j<)}  (3) 

l<n<N   1    1   iJ 

In  principle  the  state  valuation  functions  {¥(5)}  might  be 
recursively  built  up  by  iterative  backwards  induction  on  the  stages 
of  each  project,  using  equation  (3).   In  most  actual  cases  the 
computation  would  be  a  brute  force  task  of  horrendous  proportions 
because  the  "curse  of  dimensionality"  is  likely  to  be  so  strong. 

At  any  state  S,  the  optimal  project  to  select,  n*(S),  is  that 
alternative  which  maximizes  the  right  hand  side  of  (3) .   If  two  or  more 
policies  tie,  it  makes  no  difference  how  the  tie  is  broken.   Note 
that  although  an  optimal  strategy  is  implicitly  contained  in  equation  (3) , 
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the  form  of  that  strategy  is  nothing  more  than  a  complete  enumeration 
of  what  to  do  in  all  possible  situations,  with  no  visible  economic 
or  other  interpretation. 

In  the  solution  of  the  standard  search  model,  a  reservation  price 
equal  to  its  certainty  equivalent  is  assigned  to  each  box.   The  reserva- 
tion price  of  a  closed  box  is  that  hypothetical  cutoff  value  of  a 
deterministic  fallback  reward  which  would  make  it  just  equal  to  the 
expected  net  gain  of  opening  the  box  and  having  the  certain  reward  to 
fall  back  upon.    In  an  optimal  policy,  that  box  with  highest  reservation 
price  is  opened  next. 

The  contribution  of  the  present  paper  is  to  show  that  essentially 
the  same  idea  works  for  multi-stage  search  problems.  A  project  in  any 
stage  is  assigned  a  reservation  price,  calculated  in  an  analogous  manner 
to  the  standard  search  model.  The  reservation  price  of  a  project-stage 
determines  its  ordinal  ranking,  telling  when  to  fund  this  project-stage 
relative  to  other  project  stages.  Thus,  all  the  advantages  of  a  simple 
rate  of  return  criterion  apply  in  the  context  of  search  with  accumulated 
information. 

Examples 

We  exhibit  some  examples  of  bandit  processes  from  three  broad 
application  areas. 


See  Weitzman  [1979],  Lippman  and  McCall  [1976]. 
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(1)   Search 

Simple  box  search  Is  the  prototype  problem  of  the  present  paper. 
It  will  become  apparent  why  we  have  chosen  this  model  as  our  primary 
conceptual  antecedent  when  we  examine  the  solution  equations  for  the 
general  bandit  process. 

Suppose  there  is  a  collection  of  closed  boxes,  not  necessarily 
identical.   Each  box  contains  a  potential  reward  sampled  from  a 
probability  distribution.   At  some  cost,  and  after  an  appropriately 
discounted  waiting  interval,  a  box  can  be  opened  and  its  contents  made 
known. 

At  each  decision  node,  the  decision  maker  must  decide  whether 
or  not  to  open  a  box.   If  search  is  terminated,  the  maximum  reward 
thus  far  uncovered  is  collected.   If  search  is  continued,  the  decision 
make  must  select  the  next  box  to  be  opened,  pay  at  that  time  the  fee 
for  opening  it,  and  wait  for  the  outcome.   Then  will  come  the  next 
decision  node.   The  object  is  to  find  a  strategy  which  maximizes 
expected  present  discounted  value. 

Suppose  that  box  n  contains  a  potential  reward  Y.  with 
probability  q.  for  1=1,2,...   It  costs  C  to  open  box  n  and  learn  its 
contents,  which  become  known  after  a  time  lag  reflected  by  the  discount 
factor  B  . 


6 

See  Weitzman  [1979],  Lippman  and  McCall  [1976],  and  the  references  cited 
therein. 
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Simple  box  search  can  be  posed  as  a  bandit  process  using  the 

following  notation.   In  state  o  the  box  is  closed  and  P^  =  q?, 

R  =  -C  ,  6  =  3  .   If  the  box  is  opened  and  in  state  i  >  1,  we  say 

it  is  in  an  absorbing  state  with  p"  =  1,  R?  =  y",  b"  =  0. 

ii       1    i   i 

Further  specialized  cases  of  simple  box  search,  like  locating 

an  object,  or  the  gold  mining  problem,  are  a  fortiori  examples  of  bandit 

7 
processes. 

(2)   Scheduling 

The  so-called  "resource  pool  problem"  is  the  simplest  case  of  a 

g 
non-trivial  deterministic  bandit  process.    It  highlights  the  pure 

scheduling  or  fitting  aspect  of  a  bandit  process  in  its  most  direct 

form,  free  of  search,  learning,  or  other  features. 

At  each  instant  of  time  a  depletable  resource  can  be  drawn  from 
any  one  of  a  number  of  pools.   The  cost  of  removing  an  extra  unit  from 
a  pool  depends  on  how  much  has  already  been  taken  out  of  it.   What 
policy  supplies  a  fixed  flow  of  the  resource  at  minimum  present  discounted 
cost? 

If  i  units  have  been  drawn  from  pool  n,  it  costs  C  to  extract  the 

next  unit.  Then  P,  .,,  =  1,  R^  =  -C^ ,  6.  =  a  defines  a  resource  pool 

1,1+1       i     i    i 

,  ,        ,   , .  „.      ,      .  _n  <  _n 

problem  as  a  bandit  process.   The  non-decreasing  cost  case  C.  =  C  - 

is  the  familiar  situation  where  a  myopic  marginalist  rule  is  optimal. 


See  Kadane  and  Simon  [1977],  Kadane  [1969],  and  the  reference  cited 
therein. 

8 

See  Weitzman  [1975]. 
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In  the  more  interesting  case,  the  resource  pool  problem  confronts  the 
issue  of  evaluating  situations  with  a  range  of  decreasing  costs. 

Obviously  uncertainty  can  be  introduced  into  a  resource  pool 
problem  without  disturbing  its  status  as  a  bandit  process  (although 
it  will  typically  be  somewhat  harder  to  solve).   Just  as  one  example, 
suppose  that  pool  n  costs  a  fixed  overhead  charge  of  K   to  open  up  and 
has  constant  marginal  cost  C   thereafter,  but  reserves  are  uncertain. 
A  similarly  structured  problem  is  the  task  of  scheduling  a  number  of 
jobs  to  be  carried  out  by  a  single  machine  when  the  jobs  differ  in  value, 
and  completion  times  are  random  variables. 

Actually,  with  a  judicious  interpretation  of  "resource  pool",  a 
bandit  process  formulation  is  sufficiently  general  to  include  as  special 
cases  many  of  the  standard  operations  research  models  for  such  problems 
as  equipment  durability  selection  and  replacement,  inventory,  maintenance, 
and  production  scheduling,  or  capacity  expansion.   In  such  situations 
there  are  typically  several  classes  of  pools,  each  of  which  contains 
an  infinite  number  of  identical  members.   For  equipment  problems  a 
"resource  pool"  is  a  certain  piece  of  equipment  and  the  "amount  extracted" 
is  the  length  of  time  it  has  been  in  service.   With  scheduling, 
inventory,  and  expansion  problems,  a  pool  is  a  certain  strategy-schedule 
(like  ordering  inventory,  or  building  capacity)  which  goes  up  to  some 
expiration  or  regeneration  point  (after  which  costs  are  interpreted  as 
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being  infinite) ;  the  "amount  extracted"  is  the  length  of  time  the  given 
strategy-schedule  has  been  carried  out. 

(3)   Learning  (or  Information  Gathering) 

The  following  numerical  example  conveys  the  flavor  of  multi-stage 
learning  as  a  generalization  of  search. 

Suppose  the  research  department  of  a  large  organization  has  been 
assigned  the  task  of  developing  some  new  product.   Two  independent 
technologies  are  being  considered,  both  of  which  are  uncertain. 
Because  they  both  produce  the  same  product,  no  more  than  one  technology 
would  actually  be  used  even  if  both  were  successfully  developed. 

For  each  technology,  research  goes  through  two  stages.   First, 
a  preliminary  feasibility  study  is  made.   If  the  outcome  of  the 
feasibility  study  is  unfavorable,  the  technology  has  no  chance  of  success. 
If  the  feasibility  study  is  favorable,  the  technology  may  be  successful,  ' 
but  this  can  only  be  determined  after  mounting  a  full  scale  R&D 
effort. 

Table  1  summarizes  the  relevant  information. 

TABLE  1 


Project 

Probability  of  Success 
Estimated  Before 
Feasibility  Study 

Cost  of  Feasibility 
Study 

Probability  of  Passing 
Feasibility  Study 

Cost  of  Full  Sclea  R&D 


.5 


.8 

21 


B 


.4 


.5 
24 
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For  which  technology  —  A  or  B  —  should  a  feasibility  study  be 
ordered?   We  discuss  the  answer  in  the  applications  section. 

General  multi-stage  problems  of  this  sort  can  be  written  as 

bandit  processes.   The  corresponding  bandit  process  is  typically  a 

unidirectional  branching  tree  with  terminal  absorbing  states; 

if  P , .  >  0,  then  either  j  >  i  or  j  =  i  and  P,^  =  1.   In  the  previous 
ij  ■"        ■"  li 

example,  the  terminal  absorbing  states  offer  as  reward  either  some 
large  unspecified  constant  (success),  or  zero  (failure).   Rewards  in 
a  transition  state  are  the  negative  of  development  costs  at  that 
stage. 

In  previous  work,  we  have  modeled  a  research,  development  or 
exploration  project  where  the  potential  reward,  which  can  only  be  collected 

after  all  development  work  has  been  completed,  is  viewed  as  a  sum  of 

9 
independent  random  variables  across  component  development  stages. 

As  additional  research  money  is  paid  to  develop  another  stage,  the 
"contribution"  of  that  stage  to  the  final  reward  becomes  known.   Hence, 
the  distribution  of  final  rewards  is  continually  shifting  as  the  contri- 
bution of  each  stage  turns  out  better  or  worse  than  expected;  and  the 
distribution  narrows  with  development  because  less  uncertainty  remains 
to  be  resolved.   Research  costs  are  paid  both  to  move  the  project  towards 
completion  and  to  find  out  more  information  about  potential  rewards. 
If  the  decision  maker  has  paid  all  development  costs  leading  up  to  the 
final  stage,  development  uncertainty  has  been  eliminated  and  the 


9 

See  Roberts  and  Weitzman  [1979].   The  model  is  solved  and  analyzed 

by  continuous  time  methods. 
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reward  can  be  collected.   At  any  stage  the  project  can  be  abandoned 
and,  viewed  ex   post,  the  previously  sunk  development  costs  have  nothing 
to  show.   A  collection  of  such  projects  is  an  example  of  a  bandit 
process. 

Perhaps  the  most  classic  model  of  learning  is  the  multi-armed 

bandit  problem  which  served  Gittens  and  his  co-workers  as  a  prototype 

1   10 

example.     At  any  stage  the  decision  maker  has  an  estimate  of  the 

distribution  of  success  probabilities  for  each  arm,  which  in  our 
language  is  the  state  of  the  arm.   When  an  arm  is  played,  some  reward 
is  expected  and  depending  on  what  is  actually  received,  the  estimate 
(or  state)  is  updated  in  Bayesian  fashion.   The  multi-armed  bandit 
problem  embodies  a  classical  trade-off  between  taking  high  expected 
rewards  now  and  acquiring  information  which  may  be  valuable  later. 

It  should  be  obvious  that  many  economic  aspects  of  "learning  by 
doing"  can  be  modeled  as  a  bandit  process. 

The  Basic  Theorem 

Consider  the  following  functional  equation: 

V^(Z)  =  max  {  Z,  r"  +  6"  E  P°  v"(Z)}      Vi  (4) 

Under  very  weak  conditions,  (4)  will  have  a  unique  solution  for  each  Z. 


10 

See  Gittens  [1979],  Whittle  [1980],  and  the  references  cited  therein. 
For  applications  to  economics,  see  Rothschild  [1974]  and  the  references 
he  cites. 
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V  (Z)  possesses  an  important  economic  interpretation.   Consider 
the  artificial  bandit  process  where  one  of  the  projects  is  n  in  state  i 
and  the  other  is  a  fallback  lump  sum  reward  Z  which  can  be  collected  at 
any  time,  whereupon  the  entire  process  must  be  discontinued.   V  (Z) 
is  the  expected  present  discounted  value  of  being  in  state  i  of  project 
n  and  following  an  optimal  stopping  rule  when  the  consolation  prize  is 
Z.   The  difference  V.(Z)  -  Z  might  be  called  the  "option  value"  of  being 
in  i(n). 

A  fixed  point  of  V  (Z)  which  will  play  an  indispensable  role  is 

z" 

defined   to   satisfy 

Z"  =  V"(Z")    =   r"  +   b"   E   P':^.    V'^CZ").  (5) 

111  iiijji 

It  is  not  hard  to  prove  existence  and  uniqueness  of  Z  .   It  is  also 
not  difficult  to  show  that 


V?(Z)  =  Z  for  Z  2  z^ 


v"(Z)  >  Z  for  Z  <  zj 

Now  Z.  has  two  important  interpretations  and  one  truly  extra- 
ordinary property. 

An  interesting  interpretation  is  that  Z  represents  that  value  of 
the  fallback  position  which  would  make  a  decision  maker  just  indifferent 
between  continuing  project  n  at  stage  i  and  abandoning  it  in  favor  of 
taking  the  fallback  reward  immediately.   In  economist's  terminology, 

Z.  is  a  reservation  price. 

1  "^ 
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An  equally  valuable  way  of  comprehending  Z  Is  to  note  that  it 
represents  the  expected  present  discounted  value  of  an  optimal  policy 
for  a  bandit  process  consisting  of  an  infinite  number  of  projects,  all 
of  type  n  in  state  i.   This  interpretation  follows  from  the  fact  that 
Z  so  defined  must  satisfy  (5). 

The  extraordinary  property  is  that  Z.  contains  all  relevant  infor- 
mation about  project  n  in  state  i  for  any  bandit  process  of  which  it  is 
a  member.   The  optimal  rule  is  to  proceed  next  with  that  project-state 
having  the  highest  reservation  price.   This  unusual  feature  can  lead 
to  striking  results  and  powerful  characterizations  of  an  optimal  policy. 
Note  that  Z.  does  not  at  all  represent  the  value  of  project  n  in  any 
traditional  economic  sense;  there  is  a  crucial  distinction  between  the 
marginal  value  of  a  project  and  the  order  in  which  it  should  be  under- 
taken. 

Adopting  the  notation 

V^^=V^(Z^).  (6) 

a  standard  contraction  mapping  argument  shows  that  the  system  of 
equations 

V"  =  r"  +  B"  I  P".V".  (7) 

ii    1    i    ij  Ji 

V'?.  =  max  {V".,  r"  +  B"!   Z  p"  v"  }  (8) 

Ji         11   J     J    Jk  ki 

has  a  unique  solution  if  0  <  B.  <  1. 
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The  following  theorem  is  the  basic  result  of  this  paper. 

N 
Theorem:   The  optimal  policy  in  state  S  =  X   i(n) 

n=l 


Is  to  select  the  project  n*  for  which 


n*        n 

Z   =  max  Z  ,  (9) 

n 


where 


z"  =  v"" 

i(n) ,i(n) 


Proof  of  the  Basic  Theorem 

Henceforth  we  suppress  the  cumbersome  notation  iCn)  where  its 
use  is  superfluous.   Unless  otherwise  noted,  project  n  is  in  state  i(n). 

Throughout  the  proof  it  will  be  convenient  to  work  with  an 
equivalent  undiscounted  problem  where  1-0.  is  now  interpreted  as  the 
probability  that,  if  project  n  in  state  i  is  chosen,  everything  stops 
and  the  entire  process  ends  (with  zero  reward) .   Transition  probabilities 
are  then 

q".  =  3^  P".. 

The  two  formulations  are  mathematically  identical,  but  the  inter- 
pretation of  the  equivalent  undiscounted  problem  is  easier. 

Let  G.(Z)  be  the  probability  that  Z  will  eventually  fall  to  or 
below  Z  if  project  n  is  continued  forever  starting  from  state  i.   More 
formally, 

G?(Z)  =1  if  Z?  <  Z 

g"(z)  =  zq^.g'^cz)      if  Z?  >  Z 
i        xj  J  1 
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An  Interesting  relation  which  we  will  not  use  directly  is 

dvJ(Z)     n,,, 

—± ■=  G.  (Z)  a.e.  * 

dZ        i 

■ 

At  any  stage  let  A(Z)  be  a  "continuation  set"  of  projects  whose 
reservation  prices  are  greater  than  Z.   More  formally, 

z"  >  Z     ■<-»■      n  E  A. 

Define  the  function 
W(Z) 

to  be  the  maximum  expected  present  discounted  value  of  a  bandit-like 
process  played  under  the  following  conditions: 

stopping  rule:    retire  the  process  and  collect  Z  if  and  only 

if  A  is  empty. 

« 
selection  restriction:   the  next  project  to  be  selected  must  ' 

belong  to  A.  i 

It  Is  not  difficult  to  prove  that  W(Z)  is  a  continuous  function. 

Analogously,  define 

W"(Z) 

to  be  the  value  of  an  optimal  policy  under  the  above  rules  with  the 

overriding  contraint,  only  initially  operative,  that  project  n  must- 

be  selected  first.  Then 

W(Z)  =  Z  if  A  is  empty 

W(Z)  =  max  w"(Z)     ^,    . 

otherwise. 
neA 
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Note  that  when  Z  =  -oo.  In  effect  there  are  no  restrictions  and 
W(-°°)  Is  the  value  of  an  optimal  policy  in  the  original  problem.   The 
theorem  will  be  proved  if  we  can  show  that 

W(-oo)  =  wl(-°o), 

where  without  loss  of  generality  the  projects  are  so  ordered  that 

™1       „n 
Z  =  max  Z  . 

n 

Lemma  1 : 

dW    N   n,^, 

— -  =  II.  G  (Z)  a.e. 

dZ   n=l 

Proof:   In  whatever  order  they  are  used,  projects  are  abandoned 

if  and  only  if  their  reservation  prices  fall  to  Z  or  below.   The 

abandoning  of  each  project  is  an  independent  stochastic  event.   The 

probability  that  the  entire  process  is  retired  and  Z  is  collected  is 

therefore 

I    G^Z) 
n=l 

In  the  problem  as  we  have  formulated  it,  there  are  essentially 

a  finite  number  of  reservation  prices.   Suppose  Z  4  1.    for  all  i  and  n. 

Define 

Z  =  max  Z . 
-   Z\<Z     ^ 

T.  .      r,n 

Z  =  min  Z . . 

z':^>z  ^ 

1 
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Let  X  and  Y  be  any  two  points  in  the  interval  (Z,Z).   Then 
A(X)  =  A(Y)  and  g"(X)  =  g"(Y)  under  all  possible  states  of  the  system. 

Consider  a  policy  identical  to  the  optimal  policy  for  X,  except 
that  Y  is  collected  instead  of  X.   This  policy  is  feasible  for  Y, 
hence 


w(Y)  >  w(x)  +  (Y-x)  n.  g"(x). 


By  a  symmetric  argument, 

W(X)  >  W(Y)  +  (X-Y)    ii,  G°(Y). 

n=l 


N  .n 


Thus,  the  function  W  is  linear  in  the  interval  (Z,Z)  and 

piecewise  linear  in  (-°°  ,  °°) .   Except  for  (a  finite  number  of)  policy 

N   ^ 
switch  points,  the  slope  of  W(Z)  is  given  by   H  G  (Z) . 


Lemma  2 ; 

dW  ^  dW"*" 
dZ    dZ 

for  almost  all  Z  <  Z  . 


Proof:   That 


f'=  n,  G^z) 

dZ    n=l 


for  Z  <  Z  follows  by  the  same  logic  as  Lemma  1. 


Lemma  3: 


wcz-"-)  -  "/(z-"-). 


Proof:  Follows  from  the  relevant  definitions, 


Theorem: 


W(-oo)  =  w-^(-«) 


Proof:   Follows  from  Lemmas  2  and  3. 
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Corr.putation 

For  project  n,  we  need  to  calculate  {V..}  only  for  that  state  i(n) 
currcntlv  occupied.   Once  V..  is  determined,  further  calculations  for 
project  n  are  unnecessary  until  and  unless  project  n  is  actually  selected. 

The  equations  (7),  (8)  are  decomposable  by  project  n  and  by  originating 

state  i,  so  that  the  calculations  of  {V..}  with  j  varying,  i  and  n 

fixed,  can  be  done  independently  of  other  i  and  n.   Without  loss  of 

generality,  therefore,  we  suppress  Indices  i,  n  and  show  how  to  calculate 
the  solution  to  the  system 

V.  =  R.  +  B.  I  P.  .V.  -  (10) 

1    1     1    13  1  -    " 

V,  =  max{V.,  R.  +  6.  I  P.,  V  }  vj^^i       (11) 

where 

V.  =  V?. 
1    11 

V.  =  v". 

The  system  of  equations  (10) ,  (11)  is  of  a  classical  form  familiar 
from  input-output  theory  with  variable  techniques,  or  Markov  chain  theory 
with  alternative  policies.     Two  basic  solution  methods  are  commonly 
available. 

Perhaps  the  easiest  algorithm  is  successive  approximations.   The 


^■'^See,  e.g.,  Weitzman  [1967]  or  Howard,  [1960] 
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Iterations 

V^(t+1)  =  R^  +  B^ZP  V.(t)  (12) 

V,(t+1)  =  max{V.  (t),  R.  +  6.ZP.,V  (t)}        J  ,«  i       (13) 

converge  to  the  solution  of  (10),  (11)  from  any  initial  V  (0) ,  {V.(0)}. 

Problems  of  inimense  dimensions  could  be  calculated  because  the 
principal  effective  constraint  on  (12) ,  (13)  is  not  computation,   but 
storage  size. 

Furthermore,  after  project  n  has  been  selected  and  has  moved  from 

state  i  to  state  k,  the  previously  calculated  {V..}  are  natural  starting 

values  for  {V.,  (0)},  the  initial  approximations  for  {V.,  }  under  the  new 
jk  "'^  jk 

state  k. 

An  alternative  approach  is  policy  iteration.   Starting  with  a  prescribed 
policy  in  each  state  j  /  i  of  choosing  either  to  continue  or  to  stop, 
we  calculate  the  solution  to  the  (now  linear)  set  of  value  equations  to 
which  (10),  (11)  reduces.   These  are  used  to  select  a  new  policy  on  the 
basis  of  value  maximization,  which  defines  another  step  of  the  iteration. 
The  advantage  of  this  approach  is  finite,  speedy  convergence.   The  dis- 
advantage, considerable  in  large  problems,  is  that  a  matrix  must 
be  inverted  at  each  iteration. 

More  exotic  approaches,  like  fixed  point  algorithms,  could  in 
principle  be  employed. 
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Whatever  metliods  are  used  to  solve  (10),  (11),  we  note  the 
computational  superiority  of  the  reservation  price  approach  over 

traditional  dynamic  programming.   If  there  are  N  projects,  each  with 

2 
I  states,  there  are  at  most  NI   numbers  to  be  calculated  by  the 

present  approach.   By  the  traditional  backwards  recursion  dynamic 

N 
programming  approach  of  equation  (3) ,  there  are  at  most  I   numbers  to 

be  determined.   If  I  =  N=10,  this  is  a  difference  between  computing 

one  thousand  numbers  and  ten  billion  numbers! 

Applications 

(1)   Search 

Studj'ing  simple  box  search  is  a  very  useful  way  to  understand  the 
basic  features  of  bandit  process  solutions  because  the  equation  defining 
the  reservation  price  of  a  simple  box 

Z  =  -C  +  B^q.  max{Z,Y.}  (lA) 

is  a  miniature  version  of  the  bandit  process  reservation  price  equations 
(10),  (11). 

From  (14) ,  the  reservation  price  of  a  box  is  completely  insensitive 
to  the  probability  distribution  of  rewards  at  the  lower  end  of  the  tail. 
Any  rearrangement  of  the  probability  mass  located  below  Z  leaves  Z  unaltered. 
It  is  important  to  understand  this  feature.   Considering  that  a  box  could 
be  opened  at  any  time,  the  only  rationale  for  opening  it  now  is  the 
possibility  of  drawing  a  relatively  high  reward.   That  is  why  the  lower 
end  of  its  reward  distribution  is  irrelevant  to  the  order  in  which  box  i 
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should  be  sampled  even  though  It  may  well  influence  the  value  of  an 
optimal  policy. 

On  the  other  hand,  as  rewards  become  more  dispersed  at  the  upper 
end  of  the  distribution,  the  reservation  price  increases.   Other  things 
being  equal,  it  is  optimal  to  sample  first  from  distributions  which 
are  more  spread  out  or  riskier  in  hopes  of  striking  it  rich  early. 
This  is  a  major  conclusion.   Low-probability  high-payoff 
situations  should  be  prime  candidates  for  early  investigation  even 
though  they  may  have  a  smaller  chance  of  ending  up  as  the  source 
ultimately  yielding  the  maximum  reward  when  search  ends. 

The  standard  comparative  statics  exercises  performed  on  (lA)  yield 
anticipated  results.  Reservation  price  decreases  with  greater  search 
cost,  increased  search  time,  or  a  higher  interest  rate.   Moving  the 
probability  mass  of  rewards  to  the  right  makes  Z  larger.   Thus  although  ^ 
there  is  no  necessary  connection  between  the  mean  reward  and  the  reserva- 
tion price,  there  is  a  well-defined  sense  in  which  higher  rewards  increase 
the  reservation  price.   Similarly,  performing  a  mean  preserving  spread 
on  the  distribution  function  makes  Z  bigger.   In  this  sense  a  riskier 
distribution  of  rewards  implies  a  higher  reservation  price. 

(2)   Scheduling 

The  reservation  prices  of  the  deterministic  resource  pool  problem 

^^^                       I   ^n  t-i 
n     1        E .C  a 
Z?  =  -  T^  inf  t=i  t  f     . 

i     1-a  T^i  ~T TIT^  ^-"-^^ 

E  ct 
t^i 

12 

See  Weitzman  [1976]. 

In  the  present  context  a    means  a  multiplied  by  itself  t-i  times. 
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The  reservation  price  of  a  pool  is  (a  negative  multiple  of) 
the  minimum  equivalent  stationary  cost  per  barrel  of  oil  from  that 
source,  which  we  might  call  the  implicit  cost  of  the  pool. 

Converting  arbitrary  cost  streams  to  stationary  equivalents  for 
the  purpose  of  finding  the  cheapest  alternative  is  an  old  economist's 
trick.   The  optimality  of  the  max  Z  rule  can  in  a  sense  be  Interpreted 
as  justifying  this  heuristic  procedure  under  certain  conditions. 

Note  that  the  implicit  cost  of  a  pool  reduces  to  its  marginal  cost 
for  the  special  case  of  nondecreasing  costs. 

At  the  opposite  extreme,  with  decreasing  costs  over  the  entire 

range,  the  infimum  in  (15)  is  obtained  for  T  =  °=.   Once  an  infinite 

capacity  non-increasing  cost  source  is  opened  up,  in  an  optimal  policy 
it  should  operate  forever. 

To  illustrate  the  typical  form  of  an  optimal  policy  for  exploiting 
depletable  natural   resources,  consider  the  following  simple  example. 
Suppose  there  are  but  two  resource  pools.   Each  pool  has  an  initial 
range  of  decreasing  costs,  followed  by  a  final  section  of  increasing 
costs.   Let  the  pools  be  ordered  so  that  the  first  has  lower  implicit 
cost  than  the  second.   The  optimal  strategy  will  be  to  initially  exploit 
the  pool  with  the  lowest  implicit  cost,  pool  number  one.   This  will  be 
done  until  the  marginal  cost  of  extracting  one  more  unit  (in  the  increasing 
cost  range)  becomes  greater  than  the  implicit  cost  of  the  second  pool. 
At  that  time  pool  two  will  start  being  exploited,  and  it  will  be  the 
exclusive  source  until  its  marginal  cost  in  the  increasing  cost  range 
becomes  greater  than  the  marginal  cost  of  pool  one.   Then  pool  one  or 
two  will  alternately  be  exploited,  depending  on  which  is  currently  the 
cheaper  source  at  the  margin. 
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Reservation  prices  for  stochastic  resource  pool  problems  are  often 

easy  to  calculate  and  typically  are  interpretable  as  a  probabilistic 

version  of  (15).   For  example,  if  it  costs  a  fixed  overhead  charge  of 

k"  to  open  pool  n  and  a  variable  cost  C  per  unit  extracted  thereafter, 

but  reserves  are  a  random  variable  T,  then 

,    k"  +  C''  F  5  a^ 
7n 1_   i^n 

o"   1-0.  E^at 

t=o 

before  the  pool  is  opened  (state  i=o) .   For  i>l. 


i   1-a 

after  the  pool  is  opened.   With  the  above  specification,  once  a  pool  is 
tapped  in  an  optimal  policy  it  is  run  until  dry. 

Note  that  for  a  situation  where  all  pools  are  the  same  and  there  is 
an  unlimited  collection  of  them,  the  optimal  policy  will  be  cyclic  or 
recursive.   The  same  conclusion  holds  if  there  are  several  classes  of 
pools,  each  class  containing  an  infinite  number  of  identical  pools 
(because  in  an  optimal  strategy  only  pools  from  one  class  will  be 
tapped).   This  is  why  so  many  of  the  standard  operations  research  models 
with  stationary  probability  distributions  (for  example,  equipment 
durability  selection  and  replacement,  inventory,  maintenance,  and 
production  scheduling,  or  capacity  expansion)  end  up  having  a  repetitive 
solution  which  may  be  universally  characterized  as  follows.   At  each 
decision  node,  choose  the  strategy  element  with  lowest  expected 
equivalent  stationary  cost. 
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(3)   l.oni-nin['. 

In  simple  box  search  v;e  cbserved  thnt  if  rewards  are  more  spread 
out,  or  in  other  words  the  probability  distribution  collapses  more 
completely  when  drawing  a  sample  from  it,  the  reservation  price  is 
increased.   This  principle  generalizes. 

The  possibility  of  learning  increases  the  reservation  price  of  a 
project  by  a  premium  reflecting  how  rapidly  the  probability  spread 
of  rewards  narrows  as  more  steps  of  the  project  are  undertaken.   This 
is  a  crucial  feature  of  information  gathering  processes. 

The  learning  effect  is  quite  pivotal  in  the  numerical  R&D  example. 
Before  the  feasibility  study,  project  A  with  probability  .8  has  a  .625 
chance  of  success  and  with  probability  .2  has  a  0  chance  of  success; 
likewise  project  B  with  probability  .5  has  a  .8  chance  of  success 
and  with  probability  .5  has  a  0  chance  of  success.   Viewed  as  a  binomial 
event,  the  pre-f easibility  variance  of  A  is  (.5)  (.5)  =  .25,  and  of 
B  is  (.4)  (.6)  =  .24.   The  expected  post-feasibility  variance  of  A  is 
(.8)  (.625)  (.375)  =  .19,  and  of  B  is  (.5)  (.8)  (.2)  =  .08.   On  average, 
the  feasibility  study  reduces  the  variance  of  A  by  only  .06  compared 
with  .16  for  B.   The  learning  or  information  effect  is  strong  enough  to 
favor  starting  with  B,  which  is  an  inferior  project  to  A  in  all  other 
respects.   If  in  the  first  stage  (feasibility  study)  the  cost  is  C  , 
and  the  probability  of  success  is  P^ ,  and  in  the  second  stage  (full 
scale  R&D)  the  cost  is  C  and  the  probability  of  success  conditional 

upon  passing  stage  1  is  ?»,  then  Z  =  -(C^+P^C^)/P  P..   The  reader  can 

R   A 
verify  Z  >Z  . 
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With  the  multl-amied  bandit  problem,  there  is  a  comparable  learning 
effect  favoring  arms  with  more  diffuse  priors.   Mean  preserving  spreads 
of  bandit  arm  priors  increase  the  reservation  price.   On  average,  the 
reservation  price  of  an  arm  is  expected  to  decline  over  time,  as  the 
probability  distribution  for  the  arm  contracts  after  sampling.   There 
will  of  course  be  instances  when  the  decision  maker  should  sample  a 
high-variance,  low-mean  arm  even  though  he  knows  he  is  likely  to  abandon 
it  in  favor  of  a  low- variance,  high-mean  arm. 

From  the  analysis  of  theoretical  models,  new  insights  are  possible 
into  the  properties  of  the  R&D  search  process  as  a  whole.   Examination 
of  how  reservation  prices  tend  to  change  with  the  development  of  a  line 
of  research  allow  one  to  describe  the  way  in  which  research  can  be  ex- 
pected to  proceed. 

To  take  an  example,  it  is  possible  to  infer  from  the  Roberts/Weitzman 
analysis  of  a  single  research  project  that  the  reservation  price  can  be 
expected  to  fall  over  time  when  the  uncertainty  about  ultimate  rewards 
is  high  in  comparison  to  the  costs  required  to  complete  the  research 
project.   If  a  planner  is  facing  a  situation  with  several  projects  of 
this  type,  he  can  expect  the  optimal  selection  rule  to  involve  considerable 
switching  and  reswitching  betv;een  projects.  The  real  world  implication 
is  that  in  such  situations  it  may  be  optimal  to  pursue  a  parallel  research 
effort  with  more  than  one  project  being  developed  at  the  same  time.   By 
contrast,  when  the  uncertainty  about  ultimate  rewards  is  low  relative  to 
R&D  costs,  the  reservation  price  of  a  project  can  be  expected  to  rise 
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because  as  more  stages  are  developed,  total  remaining  costs  to  comple- 
tion are  lowered,  without  much  gain  In  Information.   In  a  situation 
with  several  projects  of  this  type,  the  optimal  selection  rule  will 
tend  to  go  with  one  "best"  project  from  beginning  to  end. 

(4)   A  Composite  Example 

Most  bandit  processes  exhibit  features  common  to  more  than  one 
"pure  case".   Consider,  for  example,  the  following  stylized  mining 
problem  which  illustrates  nicely  the  interacting  of  search,  scheduling, 
and  learning. 

A  company  seeks  to  extract  a  natural  resource  at  a  fixed  pre- 
determined flow  rate  from  any  one  at  a  time  of  a  number  of  different 
potential  mine  locations. 

For  notational  convenience  we  drop  the  superscript  referring  to 
mine  location. 

If  a  given  location  contains  ore,  it  contains  enough  to  last  for 
T  years  at  the  fixed  extraction  rate,  where  T  is  a  random  variable  with 
a  known  distribution.   The  initial  overhead  cost  of  opening  the  mine 
would  be  K,  and  the  operating  cost  would  be  C  per  year. 

By  paying  a  (relatively  inexpensive)  testing  cost  of  C^ ,  the  company 
can  perform  a  geological  survey  which,  with  probability  1-P-,  .  will  rule 
the  site  out  as  altogether  implausible.   If  the  survey  is  favorable, 
by  paying  a  (relatively  expensive)  cost  of  C.,  the  company  can  do  a  test 
drilling  which  will  strike  ore  with  probability  p„. 

The  interest  rate  is  r.   What  is  the  next  site  to  explore? 
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Applying  the  formula  for  Z  ,  we  can  derive 

C^+  Pj  C2  +  p^P2  (K  +  (C/r)  (1-E  e"^"^  )  ) 
o  "  "  Pj^p^d-Ee-rT) 


With  C^  =  C  =  0,  this  is  a  pure  scheduling  problem;  with  C  =  0, 
C^  >  0,  a  search  aspect  is  tacked  on,  with  C  >  0,  C  >  0,  a  learning 
stage  is  added. 

Without  a  first  learning  stage,  the  value  of  Z  would  be 
C2  +  P^P2(K+  (C/r)(l-Ee~^^  )  ) 


p^p^d-Ee-rT) 

Thus,  provided  C  <  (1-p  )C„,  adding  the  possibility  of  a  geological 
survey  makes  it  more  likely  a  decision  maker  will  want  to  investigate 
a  site,  even  though  it  increases  overall  cost  i^  the  site  contains  ore. 

The  strength  of  this  effect  increases  as  p^  is  smaller.   Holding 
other  things  constant,  including  the  overall  probability  that  the  site 
contains  ore  P,P^j  a  decrease  in  p^  means  an  increase  in  p  .   The  less 
likely  is  the  geological  survey  to  be  successful,  the  more  discriminating 
power  does  it  have,  and  the  more  desirable  is  it  to  investigate  the  site 
now  because  the  relatively  expensive  test  drilling  is  less  likely  to  be 
in  vain.  ^ 

(5)   Irreversible  Investment  and  The  Option  Value  of  Flexibility 

The  following  stylized  example  shows  the  value  of  flexibility  in 
determining  reservation  prices. 


13 

For  background,  see  Henry  [197A]  and  the  references  cited  there. 
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Suppose  that  an  irreplaceable  asset  (a  forest,  say)  can  be  put  to 
some  number  of  alternative  uses.   When  used  for  a  given  purpose,  the 

annual  (imputed)  income  of  the  forest  follows  a  random  walk  from  current 

2 
value  s  ,  with  zero  drift  and  annual  variance  o  .   (For  notational 

convenience  we  drop  the  superscript  referencing  use  option) .   An 

2 
irreversible  usage  could  be  represented  by  o  =0.   The  policy  question 

is  which  usage  to  favor  at  the  current  time. 

In  our  terminology,  s  is  the  state  of  a  usage  and  U(s;Z)  is  the 

expected  present  discounted  value  of  an  optimal  policy  if  the  only 

alternative  to  the  proposed  usage  is  irreversibly  cutting  down  the 

forest  for  a  present  discounted  reward  of  Z.   Let  Z(s)  be  the  reservation 

price  of  the  usage  under  consideration.   If  r  is  the  instantaneous 

interest  rate,  for  Z  <  Z(s)  there  is  a  sufficiently  small  6t  such  that 

U(s;Z)  =  s5t  +  (l-r6t)EU(s  +  Xa/6t;Z) 

where  X  is  a  random  variable  taking  on  the  values  +1  and  -1  each  with 
probability  h,. 

Employing  Taylor  series  approximations  and  passing  to  the  limit 

14 
yields  the  differential  equation. 

2 

■^U   -  rU  +  s  =  0 
2   ss 


for  Z  >  Z(s) 


14 

The  general  technique  being  used  is  explained  in  greater  detail  in 

Roberts  and  Weitzman  [1979].   Here  we  are  more  concerned  with  presenting 

and  interpreting  a  new  result  than  deriving  it  rigorously. 
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At  Z  =  Z(s)  we  have  the  condition 

U(s;Z(s))  =  Z(s). 
Also,  for  sufficiently  small  6t, 

Z(s)  =  s  6  t  +  (1-  rSt)  [h^  (s  +  a/6t;Z(s))  +  h   Z(s)], 

which  yields,  passing  to  the  limit  after  a  Taylor  expansion, 

U  (s;Z(s))  =  0. 
s 

The  differential  equation,  along  with  the  boundary  conditions, 
can  be  explicitly  solved  to  yield  the  formula 


1  /   .  a 

r 


Z(s)  =  ^  (s  +-^) 


The  reservation  price  Z(s)  is  the  sum  of  the  "certainty  equivalent" 


s^ 

r 


plus  the  "option  value" 

a 


r/2r~ 

The  option  value  of  a  given  flexible  usage  measures  its  incremental 
worth  over  the  hypothetical  irreversible  alternative  of  receiving  its 
current  certainty  equivalent  income  forever.   The  option  value  is  directly 
proportional  to  a,  which  parameterizes  the  uncertainty  in  the  difference 
between  the  flexible  and  irreversible  options. 
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To  give  some  idea  of  the  orders  of  magnitude  involved,  suppose 
s  =  $1  million,  o  =  $100thousand ,  r  =  5%.   Then  the  certainty  equivalent 
is  $20  million,  whereas  the  reservation  price  is  $26.4  million.   The 
option  value,  here  32%  of  the  certainty  equivalent,  can  easily  be  a  non- 
negligable  component  of  cost-benefit  analysis. 
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